Skip to content

Track per-phase duration on each instance#223

Merged
sjmiller609 merged 9 commits into
mainfrom
hypeship/instance-phase-tracking
May 12, 2026
Merged

Track per-phase duration on each instance#223
sjmiller609 merged 9 commits into
mainfrom
hypeship/instance-phase-tracking

Conversation

@sjmiller609

@sjmiller609 sjmiller609 commented May 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a small phasetracking package under lib/instances/phasetracking that records how long each instance has spent in each lifecycle phase (running, standby, paused, stopped, etc.). Tracker state lives on StoredMetadata, so it persists with the existing metadata.json per instance — no schema migration.

Instrumentation is done directly at the transition sites in create/start/stop/standby/restore/fork. We intentionally do not subscribe to the existing lifecycle event stream — that pipeline is lossy and these numbers will feed billing, so we want a direct write at the point of transition.

Each Instance API response now carries:

  • current_phase
  • current_phase_since
  • phase_durations_ms (cumulative ms per phase, with the live phase counted up through now)

Why

Today the billing pipeline that consumes hypeman instances charges based on wall-clock since CreatedAt, which double-counts intra-session standby and any other non-running time. To fix that properly we need hypeman itself to report real per-phase durations so the upstream billing calculation can compute true running time. Once this ships, the kernel-api side can switch the formula for hypeman CPU back to a platform-uptime model.

Test plan

  • go build ./...
  • go test ./lib/instances/phasetracking/...
  • go test -run TestInstanceToOAPI ./cmd/api/api/...
  • Manually exercise an instance through start → standby → restore → stop and confirm phase_durations_ms matches expectations

Note

Medium Risk
Touches instance lifecycle transition paths (create/start/stop/standby/restore/fork) and adds new API fields used for billing, so incorrect phase recording could impact cost/analytics calculations despite being additive and well-tested.

Overview
Adds persistent per-instance lifecycle phase accounting via new lib/instances/phasetracking tracker, recording cumulative time spent in phases at each externally observable transition (including special handling to advance initializingrunning based on boot markers).

Exposes this data in the Instances API as current_phase, current_phase_since, and phase_durations_ms (snapshotting live time through response time), and updates OpenAPI/generated oapi models accordingly.

Adjusts fork behavior to reset phase history (and deep-clone metadata to avoid shared maps), and expands unit/integration tests to validate phase accrual across create/standby/restore/fork and API emission/omission rules.

Reviewed by Cursor Bugbot for commit 3a95db6. Bugbot is set up for automated code reviews on this repo. Configure here.

Add a small phasetracking package that records cumulative time spent in
each lifecycle phase (running, standby, paused, etc.) using transition
bookkeeping. Tracker state is persisted with the instance's stored
metadata, so it survives process restarts without a DB migration.

Instrument transitions directly in create/start/stop/standby/restore/fork
rather than subscribing to the lifecycle event stream — the subscription
is lossy, and these numbers will feed billing.

Expose current_phase, current_phase_since, and phase_durations_ms on the
Instance API so callers (notably the kernel-api billing pipeline) can
compute true running time instead of wall-clock since CreatedAt.
@github-actions

github-actions Bot commented May 11, 2026

Copy link
Copy Markdown

✱ Stainless preview builds for hypeman

This PR will update the hypeman SDKs with the following commit message.

feat: Track per-phase duration on each instance
hypeman-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

⚠️ hypeman-typescript studio · code

Your SDK build had a failure in the lint CI job, which is a regression from the base state.
generate ✅build ✅lint ❗test ✅

hypeman-go studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ⏭️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@312c26f13a0c22a91b902b51c13a431452dca79d

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-05-12 12:40:56 UTC

Piggyback on the firecracker/QEMU standby-restore cycles and the
cloud-hypervisor fork-from-running test to assert end-to-end that
transition-site instrumentation is wired up:

- after standby: Current == standby, Cumulative[running] > 0
- after restore: Current == running, Cumulative[standby] > 0
- after fork-from-running: fork's Cumulative[running] is zero while
  source's is non-zero — locks down the Phases.Reset() semantics

No new tests, no added sleeps. The assertions read state at the same
points where the tests already check State.
@sjmiller609 sjmiller609 marked this pull request as ready for review May 11, 2026 19:40
@firetiger-agent

Copy link
Copy Markdown

Monitoring Plan Created

This PR introduces a new phasetracking package that records cumulative wall-clock time in each lifecycle phase (running, standby, stopped, etc.) for every instance. The tracker is embedded in StoredMetadata, updated at every state transition (create, standby, restore, start, stop, fork), and exposed via three new optional fields on all instance API responses: current_phase, current_phase_since, and phase_durations_ms.

The change is purely additive — no existing fields or logic are modified — and pre-existing instance metadata gracefully degrades (zero-value tracker starts accumulating on the first state transition after deploy). The main risks to watch are metadata write failures during state transitions (which could leave on-disk metadata stale) and any disruption to the standby/restore cycle caused by the new Record() calls. Current baselines show 100% hypeman spawn success rate with zero failed invocations over the past 48 hours, deployment p50 of 16–35s, and stable instance creation error counts.

Status updates will be posted automatically on this PR as monitoring progresses.

View agent

Comment thread lib/instances/standby.go
The phase tracker's Since field is persisted and exposed in the API as
current_phase_since. standby/stop were initializing `now` as local time
while create/start/restore use UTC, leaving downstream consumers with
mixed timezone offsets in the serialized value depending on which
transition last occurred. Align all transition sites on UTC. StoppedAt
moves to UTC as a byproduct, which is the correct normalization anyway.
Comment thread lib/instances/fork.go
Comment thread lib/instances/fork.go
cloneStoredMetadata previously shallow-copied the Phases tracker, which
aliased the Cumulative map between source and forked metadata — a
subsequent Record on either side would mutate both. Add Tracker.Clone
and use it from cloneStoredMetadata.

Also normalise the fork transition timestamp to UTC for consistency
with the other transition sites.
@sjmiller609 sjmiller609 requested a review from hiroTamada May 11, 2026 20:35
The recording sites previously jumped straight to PhaseRunning the moment
the VMM was up, but the public State machine stays in Initializing until
both ProgramStartedAt and GuestAgentReadyAt are hydrated from the guest
serial log. That meant Phases.Current reported "running" while the API
reported "Initializing".

Make phase tracking honest:
- create/start record PhaseInitializing on VM boot
- restore inspects the preserved markers and records whichever phase the
  guest is actually in (Running in the common case)
- hydrateBootMarkersFromLogs / persistBootMarkers detect the
  Initializing → Running boundary and Record(PhaseRunning) using the
  later marker timestamp, so the accrued Initializing duration matches
  real guest boot time rather than the wall clock when hydration ran

Transient internal substates (Paused/Shutdown inside Standby/Stop)
remain unrecorded — they're sub-ms blips inside non-yielding
orchestration that no external observer can see.
…se-tracking

# Conflicts:
#	lib/instances/query_test.go
Comment thread lib/instances/phasetracking/phasetracking.go
Comment thread lib/instances/start.go Outdated
They were never recorded — internal Paused/Shutdown substates happen
inside non-yielding orchestration calls and are intentionally not
tracked (already documented in the package doc).
After a restore from early-standby (instance standbyed before boot
markers ever hydrated), Phases.Since is set at restore time. The
markers parsed afterwards can carry timestamps from the pre-standby
boot session, predating Since by the entire standby interval. Without
the clamp, Record would silently skip the negative-elapsed accrual but
still move Since backwards — and every subsequent transition would
then over-count Running. Since this field feeds billing, clamp
forward so Since is monotonic.

Adds a regression test covering the early-standby restore path.
@sjmiller609 sjmiller609 requested a review from hiroTamada May 11, 2026 22:32
Comment thread lib/instances/phasetracking/phasetracking.go Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a29fd64. Configure here.

Comment thread lib/instances/query.go
@sjmiller609 sjmiller609 merged commit 23e332a into main May 12, 2026
11 checks passed
@sjmiller609 sjmiller609 deleted the hypeship/instance-phase-tracking branch May 12, 2026 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants